Andrej Karpathy recently released a lecture that breaks down tokenization, a key step in training large language models (LLMs). Tokenization is when text is split into smaller pieces (called tokens), which are then fed into the model. The process usually involves algorithms like Byte Pair Encoding (BPE) to train tokenizers using specific datasets.

In this lecture, Karpathy shows how to build a GPT tokenizer from scratch and explains some odd behaviors in LLMs that are caused by tokenization.

Here are some examples where tokenization is the root cause of common LLM issues:

- Why can’t LLMs spell words correctly? Tokenization.
- Why can’t LLMs do simple string tasks like reversing a string? Tokenization.
- Why are LLMs worse at handling non-English languages like Japanese? Tokenization.
- Why are LLMs bad at basic arithmetic? Tokenization.
- Why did GPT-2 struggle more than expected with Python code? Tokenization.
- Why does my LLM stop when it sees "<endoftext>"? Tokenization.
- What’s with the weird warning about "trailing whitespace"? Tokenization.
- Why does the LLM break if I ask about “SolidGoldMagikarp”? Tokenization.
- Why is YAML often better than JSON for LLMs? Tokenization.
- Why isn’t LLM really doing end-to-end language modeling? Tokenization.
- What’s the root cause of all these issues? Tokenization.

To make LLMs work better, you need to understand how tokenization affects them. While tokenization doesn’t come up much during inference (beyond setting max_tokens), prompt engineering depends on understanding tokenization limitations. If your prompt doesn’t perform well, it might be because a word, acronym, or concept wasn’t tokenized correctly—something many LLM developers overlook.

One tool commonly used for tokenization, and featured in Karpathy's lecture, is Tiktokenizer. It’s great for handling tokenization tasks in practical applications.